222 research outputs found

    End-to-end Phoneme Sequence Recognition using Convolutional Neural Networks

    Get PDF
    Most phoneme recognition state-of-the-art systems rely on a classical neural network classifiers, fed with highly tuned features, such as MFCC or PLP features. Recent advances in ``deep learning'' approaches questioned such systems, but while some attempts were made with simpler features such as spectrograms, state-of-the-art systems still rely on MFCCs. This might be viewed as a kind of failure from deep learning approaches, which are often claimed to have the ability to train with raw signals, alleviating the need of hand-crafted features. In this paper, we investigate a convolutional neural network approach for raw speech signals. While convolutional architectures got tremendous success in computer vision or text processing, they seem to have been let down in the past recent years in the speech processing field. We show that it is possible to learn an end-to-end phoneme sequence classifier system directly from raw signal, with similar performance on the TIMIT and WSJ datasets than existing systems based on MFCC, questioning the need of complex hand-crafted features on large datasets.Comment: NIPS Deep Learning Workshop, 201

    Using auxiliary sources of knowledge for automatic speech recognition

    Get PDF
    Standard hidden Markov model (HMM) based automatic speech recognition (ASR) systems usually use cepstral features as acoustic observation and phonemes as subword units. Speech signal exhibits wide range of variability such as, due to environmental variation, speaker variation. This leads to different kinds of mismatch, such as, mismatch between acoustic features and acoustic models or mismatch between acoustic features and pronunciation models (given the acoustic models). The main focus of this work is on integrating auxiliary knowledge sources into standard ASR systems so as to make the acoustic models more robust to the variabilities in the speech signal. We refer to the sources of knowledge that are able to provide additional information about the sources of variability as auxiliary sources of knowledge. The auxiliary knowledge sources that have been primarily investigated in the present work are auxiliary features and auxiliary subword units. Auxiliary features are secondary source of information that are outside of the standard cepstral features. They can be estimation from the speech signal (e.g., pitch frequency, short-term energy and rate-of-speech), or additional measurements (e.g., articulator positions or visual information). They are correlated to the standard acoustic features, and thus can aid in estimating better acoustic models, which would be more robust to variabilities present in the speech signal. The auxiliary features that have been investigated are pitch frequency, short-term energy and rate-of-speech. These features can be modelled in standard ASR either by concatenating them to the standard acoustic feature vectors or by using them to condition the emission distribution (as done in gender-based acoustic modelling). We have studied these two approaches within the framework of hybrid HMM/artificial neural networks based ASR, dynamic Bayesian network based ASR and TANDEM system on different ASR tasks. Our studies show that by modelling auxiliary features along with standard acoustic features the performance of the ASR system can be improved in both clean and noisy conditions. We have also proposed an approach to evaluate the adequacy of the baseform pronunciation model of words. This approach allows us to compare between different acoustic models as well as to extract pronunciation variants. Through the proposed approach to evaluate baseform pronunciation model, we show that the matching and discriminative properties of single baseform pronunciation can be improved by integrating auxiliary knowledge sources in standard ASR. Standard ASR systems use usually phonemes as the subword units in a Markov chain to model words. In the present thesis, we also study a system where word models are described by two parallel chains of subword units: one for phonemes and the other are for graphemes (phoneme-grapheme based ASR). Models for both types of subword units are jointly learned using maximum likelihood training. During recognition, decoding is performed using either or both of the subword unit chains. In doing so, we thus have used graphemes as auxiliary subword units. The main advantage of using graphemes is that the word models can be defined easily using the orthographic transcription, thus being relatively noise free as compared to word models based upon phoneme units. At the same time, there are drawbacks to using graphemes as subword units, since there is a weak correspondence between the grapheme and the phoneme in languages such as English. Experimental studies conducted for American English on different ASR tasks have shown that the proposed phoneme-grapheme based ASR system can perform better than the standard ASR system that uses only phonemes as its subword units. Furthermore, while modelling context-dependent graphemes (similar to context-dependent phonemes), we observed that context-dependent graphemes behave like phonemes. ASR studies conducted on different tasks showed that by modelling context-dependent graphemes only (without any phonetic information) performance competitive to the state-of-the-art context-dependent phoneme-based ASR system can be obtained

    Can Self-Supervised Neural Representations Pre-Trained on Human Speech distinguish Animal Callers?

    Full text link
    Self-supervised learning (SSL) models use only the intrinsic structure of a given signal, independent of its acoustic domain, to extract essential information from the input to an embedding space. This implies that the utility of such representations is not limited to modeling human speech alone. Building on this understanding, this paper explores the cross-transferability of SSL neural representations learned from human speech to analyze bio-acoustic signals. We conduct a caller discrimination analysis and a caller detection study on Marmoset vocalizations using eleven SSL models pre-trained with various pretext tasks. The results show that the embedding spaces carry meaningful caller information and can successfully distinguish the individual identities of Marmoset callers without fine-tuning. This demonstrates that representations pre-trained on human speech can be effectively applied to the bio-acoustics domain, providing valuable insights for future investigations in this field.Comment: Accepted at Interspeech 202

    Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

    Full text link
    Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the potential of zero-frequency filtering for jointly modeling voice source and vocal tract system information, and proposes two approaches for VAD. The first approach demarcates voiced regions using a composite signal composed of different zero-frequency filtered signals. The second approach feeds the composite signal as input to the rVAD algorithm. These approaches are compared with other supervised and unsupervised VAD methods in the literature, and are evaluated on the Aurora-2 database, across a range of SNRs (20 to -5 dB). Our studies show that the proposed ZFF-based methods perform comparable to state-of-art VAD methods and are more invariant to added degradation and different channel characteristics.Comment: Accepted at Interspeech 202

    Pronunciation models and their evaluation using confidence measures

    Get PDF
    In this report, we present preliminary experiments towards automatic inference and evaluation of pronunciation models based on multiple utterances of each lexicon word and their given baseline pronunciation model (baseform phonetic transcription). In the present system, the pronunciation models are extracted by decoding each of the training utterances through a series of hidden Markov models (HMM), first initialized to only allow the generation of the baseline transcription but iteratively relaxed to converge to a truly ergodic HMM. Each of the generated pronunciation models are then evaluated based on their confidence measure and their Levenshtein distance with the baseform model. The goal of this study is twofold. First, we show that this approach is appropriate to generate robust pronunciation variants. Second, we intend to use this approach to optimize these pronunciation models, by modifying/extending the acoustic features, to increase their confidence scores. In other words, while classical pronunciation modeling approaches usually attempt to make the models more and more complex to capture the pronunciation variability, we intend to fix the pronunciation models and optimize the acoustic parameters to maximize their matching and discriminant properties

    Integrating articulatory features using Kullback-Leibler divergence based acoustic model for phoneme recognition

    Full text link
    In this paper, we propose a novel framework to integrate artic-ulatory features (AFs) into HMM- based ASR system. This is achieved by using posterior probabilities of different AFs (esti-mated by multilayer perceptrons) directly as observation features in Kullback-Leibler divergence based HMM (KL-HMM) system. On the TIMIT phoneme recognition task, the proposed framework yields a phoneme recognition accuracy of 72.4 % which is compara-ble to KL-HMM system using posterior probabilities of phonemes as features (72.7%). Furthermore, a best performance of 73.5% phoneme recognition accuracy is achieved by jointly modeling AF probabilities and phoneme probabilities as features. This shows the efficacy and flexibility of the proposed approach. Index Terms — automatic speech recognition, articulatory fea-tures, phonemes, multilayer perceptrons, Kullback-Leibler diver-gence based hidden Markov model, posterior probabilities 1

    Combining Acoustic Data Driven G2P and Letter-to-Sound Rules for Under Resource Lexicon Generation

    Get PDF
    In a recent work, we proposed an acoustic data-driven grapheme-to-phoneme (G2P) conversion approach, where the probabilistic relationship between graphemes and phonemes learned through acoustic data is used along with the orthographic transcription of words to infer the phoneme sequence. In this paper, we extend our studies to under-resourced lexicon development problem. More precisely, given a small amount of transcribed speech data consisting of few words along with its pronunciation lexicon, the goal is to build a pronunciation lexicon for unseen words. In this framework, we compare our G2P approach with standard letter-to-sound (L2S) rule based conversion approach. We evaluated the generated lexicons on PhoneBook 600 words task in terms of pronunciation errors and ASR performance. The G2P approach yields a best ASR performance of 14.0% word error rate (WER), while L2S approach yields a best ASR performance of 13.7% WER. A combination of G2P approach and L2S approach yields a best ASR performance of 9.3% WER

    A Sector-Based, Frequency-Domain Approach to Detection and Localization of Multiple Speakers

    Get PDF
    Detection and localization of speakers with microphone arrays is a difficult task due to the wideband nature of speech signals, the large amount of overlaps between speakers in spontaneous conversations, and the presence of noise sources. Many existing audio multi-source localization methods rely on prior knowledge of the sectors containing active sources and/or the number of active sources. This paper proposes sector-based, frequency-domain approaches that address both detection and localization problems by measuring relative phases between microphones. The first approach is similar to delay-sum beamforming. The second approach is novel: it relies on systematic optimization of a centroid in phase space, for each sector. It provides major, systematic improvement over the first approach as well as over previous work. Very good results are obtained on more than one hour of recordings in real meeting room conditions, including cases with up to 3 concurrent speakers
    • …
    corecore